1. Data Background and Problem Formulation

In this project, a dataset is given with 900+ data each corresponding to individual indicators of users' experience surveys revolving around proposed system designs -- and how would potential users react or perceive this notion.

The goal is to provide a deep learning model with an acceptable rate of AUC (0.7 to 0.9) on test dataset in order to be able to predict certain words clearly.

To gain a better sense of the data, we first import the necessities to run the overall code.


1.1. Importing the data

We can then import the data via pandas package to see the overall nature of the data, the outputs, and the datatypes.

1.2. Column Definitions

1.2.1. Features

Features are only represented in the first column, which is the Requirements column. This column represents the question and/or the funcionalities of the system itself.

1.2.2. Labels

Labels, on the other hand, comprised of 8 columns in total. They are:

The scales on the labels vary from 1 to 7, in which values closer to 1 corresponds to a more negative connotation, while values closer to 7 is likely to be more positive.

For example, for column boring<==>exciting, if a user gives it a 3, whilst this may seem neutral at first glance, we can infer that the user does not regard this system design as easy-to-use -- otherwise they would give it a score of 5 or 6.

For this task, we'll run multiclass multilabel regression since we obtained a good AUC with multiclass regression on one column before with simple regression. The trick here is to not use a single column.

With this, we might want to approach the problem within the scope of Multi-output Regression

First, we separate the labels from the data.

We know from previous work that the dataset is very much imbalanced, so we would want to try oversampling across all columns and make them our new label. But first, we would want to convert features as word vectors first.

1.3. Quick EDA

We can visualize the most used words via wordcloud used in our dataset.

2. Pre-processing


2.1. Tokenization

Tokenization means splitting each word and vectorizing a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary). This process cleans up our text, lowercase, and remove punctuations. This step is basically the same from our previous work involving LSTMs.

This way, we can see which words appear the most in our training set.

After tokenization, the next step is to turn those tokens into lists of sequence.

2.2. Padding

When we train neural networks for NLP, we need sequences to be in the same size, that's why we use padding. Padding means that we convert all instances of the training data towards the same size. To do this, we need to specify a fixed length to transform all the instances on.

For the project, we can set the sequence by looking quickly at the average proportion of the training words.

We can then proceed to do the actual padding. This is done for all instances of the training example.

2.3. ROS Oversampling

Generally, oversampling makes fake data so that the minority labels are not squashed by the majority.

Constraint of using SMOTE within this dataset is that it's very fragile to classes with only one example. Since SMOTE works via k-nearest neighbors, at least two members of a class would be mandatory for it to work. For this project, we're going to use ROS.

ROS generates random data as opposed to the better, more superior SMOTE which generates "more credible" data. We can't use SMOTE on all sheets because of the n_neighbors and distance constraint.

We find the least common multiple of all class members amount, and oversample from there to get the matrix dimensions uniform.

Synthetic samples generated with respect to the most amount of classes in all columns so that input shape (features and labels) are consistent against each other.

2.4. Train-test Splitting

Finally we get to this part. We can use sklearn's train_test_split to do the usual splitting for test set (validation set is specified within building the model in the next section).

Also, specifying random_state ensures reproducibility.

3. Modelling

3.1. Building the Model

Since we're training on augmented data, we're going to approach this as conventional regression compared to the LSTM approach we're doing it before.

LSTM model in this manner would result in a pretty bad metric score (around 0.5 to 0.7 F1-score) whereas our fully-connected dense layers resulted in a pretty satisfying F1, which is around 0.8 to 1 for each column.

3.2. Result Interpretation and Evaluation

Here we evaluate on y_test, which we made sure earlier comes from a similar distribution within the training set itself.

Resulted on a pretty confident mae and mse, which I would say a good result considering the number of classes we have.

3.3. Prepare the Plotting Set

Due to the multi-dimensional nature of our results and test set, we need to reshape the predictions by using numpy's flatten() attribute, which flattens the array column-wise if we specify the argument order='F' to flatten in Fortran-style order.

Keep in mind this will indeed aggregate the metrics like on the classification_report() below.

After that, we're rounding the result to the nearest integer -- because we're tackling this originally as a regression task. For example, if a prediction is 4.7, it will be rounded to 5, and therefore will be predicted as class no. 5.

Multi-class ordinal classification cannot be implemented with standard fully-connected layer. And implementing multi-headed layers are outside the scope of this project.

3.3.1. List Initialization

After plotting the ROC curve, we can initialize some empty list to keep track of the evaluation metrics we're going to use.

We can also print classification report from each column, and append each resulting evaluation metric to the existing list we initialized.

Here we display the average of each evaluation metric by calling np.mean() on each array -- remember, each array is comprised of multiple evaluation metrics; one for each column.

3.4. Exporting the Model

We can export the model to HDF5 format designed to store large amounts of information. More info:

Our metric_df now looks like this for one single method. Note that we're trying to look for the best methods; so let's continue working.

3.5. XGB Method

For comparison, we try implementing XGB.

3.5.1. Classifier Wrapper Function

To provide brevity, we can define a wrapper so that we need to only invoke this function when we want to plot the resulting predictions across all different lea2rning algorithms.

We need only to call this plot_result_model() function and pass in the specified keyword arguments to:

  1. Make predictions on training features with .fit().
  2. Calculate corresponding evaluation metrics and save them to the lists initialized.
  3. Plot multi-class ROC.

So, we can expect that this wrapper function will be used to infer the results across different learning algorithms.

Keep in mind that the function plot_result_model() will be again wrapped within plot_all_result_model()

3.5.2. Average XGBoost ROC

Here we plot the average ROC for each class, found with XGBoost.

3.5.3. Average Classification Metrics

Here we display the average of each evaluation metric by calling np.mean() on each array -- remember, each array is comprised of multiple evaluation metrics; one for each column.

3.6. SVM Method

Another comparison would be the SVM (Support Vector Machine) algorithm. We try and see whether this shallow learning algorithm can still predict correctly across all 8 classes compared to 2 previous implementations.

3.6.1. Average SVM ROC

Here we plot the average ROC for each class, found with SVM.

3.6.2. Average Classification Metrics

Here we display the average of each evaluation metric by calling np.mean() on each array -- remember, each array is comprised of multiple evaluation metrics; one for each column.

3.7. Custom-Built Ordinal Classifier

Credit: Muhammad for Towards Data Science

Ordinal classifier basically does a similar thing to one-hot encoding -- but despite having one value each column, ordinal encoding maps unique binary values for each of our unique classes.

We implement a custom object called OrdinalClassifier built on top of any sklearn model that supports the predict_proba() function; ranging from Naive Bayes to decision trees.

Last implementation, we deduced the labels by one, from [1,2,3,4,5,6,7] to [0,1,2,3,4,5,6] due to the nature of zero-indexing of standard sklearn library. But this doesn't work if the labels aren't ranging fully from 1 to 7 -- could be 2 to 6 or 2 to 7 instead.

So the solution is to change the overall behavior of the OrdinalClassifier object to take maximum class (e.g. 7) and minimum class (e.g. 2 or 3) into consideration. From there, we simulate the np.argmax() function to be added the minimum class before being returned. We also add the minimum class as an index towards self.clfs dictionary when adding new value, so we don't need to deduce the labels by hand. In other words, labels stay as-is instead of being deduced by 1 :)

This also changes how the approach for ordinal encoding works -- instead of k - 1 like the author's original implementation, we feed k numbers of encoding because our labels are already one-indexed.

Keep in mind that the predict_proba() function does not yet implement this change, because it's only a wrapper function before eventually calling predict() in which we care the most about the result.

3.7.1. Average Ordinal Classifier ROC

Here we plot the average ROC for each class, found with the custom-built Ordinal Classifier we made earlier.

4. Other Learning Algorithm(s)

Some of the learning algorithms we try and implement other than SVM, XGB, and fully-connected neural networks are as follows:

  1. Bidirectional LSTM
  2. Radial Basis Function Network (RBFN)
  3. Multiclass Logistic Regression
  4. LSTM
  5. Ensemble: Random Forest
  6. Ensemble: Extra Trees
  7. Ensemble: Adaboost
  8. Ensemble: Bagging Classifiers

4.1. Bidirectional LSTM

Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence.

Bidirectional LSTMs basically look both ways when training on sequence-based data.

4.1.1. Interpretation

4.1.2. List Initialization

Like before, we can initialize some empty list to keep track of the evaluation metrics we're going to use.

4.2 RBFN

In the field of mathematical modeling, a radial basis function network (abbreviated RBFN) is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters.

Here, we try to construct a custom RBF class which can then be called like so:

model = Sequential()
model.add(RBFLayer(10,
   initializer=InitCentersRandom(X),
   betas=1.0,
   input_shape=(1,)))
model.add(Dense(1))

After defining our RBF class, we can start to build our model based on the RBFLayer class.

4.2.1. List Initialization

Like before, we can initialize some empty list to keep track of the evaluation metrics we're going to use.

4.2.2. Average Classification Metrics

Here we display the average of each evaluation metric by calling np.mean() on each array -- remember, each array is comprised of multiple evaluation metrics; one for each column.

4.3 Multiclass Logistic Regression

The multinomial logistic regression algorithm, an extension to the logistic regression model, involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.

4.3.1. Average Multinomial Logistic Regression ROC

Here we plot the average ROC for each class, found with Multinomial Logistic Regression.

4.4 LSTM

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem.

4.4.1. Result Interpretation and Evaluation

Here we evaluate on y_test, which we made sure earlier comes from a similar distribution within the training set itself.

4.4.2. List Initialization

Like before, we can initialize some empty list to keep track of the evaluation metrics we're going to use.

4.5 Ensemble

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

In this project, ensemble algorithms we can try out are as follows:

  1. Random Forest
  2. Extra Trees
  3. AdaBoost
  4. Bagging Classifier

We've already implemented gradient boosting via XGB on Section 3.5., so we think it's redundant to implement it all over again.

4.5.1 Random Forest Classifier

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.

4.5.1.1. Average Random Forest ROC

Here we plot the average ROC for each class, found with our Random Forest.

4.5.2 Extra Trees Classifier

Extremely Randomized (abbreviated Extra) Trees Classifier is a type of ensemble learning technique which aggregates the results of multiple de-correlated decision trees collected in a “forest” to output it's classification result.

4.5.2.1. Average Extra Trees Classifier ROC

Here we plot the average ROC for each class, found with our Extra Trees Classifier.

4.5.3 AdaBoost

AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance.

4.5.3.1. Average AdaBoost ROC

Here we plot the average ROC for each class, found with our AdaBoost.

4.5.4 Bagging Classfiers

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

4.5.4.1. Average Bagging Classifier ROC

Here we plot the average ROC for each class, found with our Bagging Classifier.

5. Metric Display